sessionInfo()Biostat 203B Homework 2
Due Feb 9 @ 11:59PM
Display machine information for reproducibility:
Load necessary libraries (you can add more as needed).
library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
library(readr)
library(dplyr)Display memory information of your computer
memuse::Sys.meminfo()Totalram: 23.471 GiB
Freeram: 21.659 GiB
In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.
Display the contents of MIMIC hosp and icu data folders:
ls -l ~/mimic/hosp/total 4429876
-rwxrwxrwx 1 zhili zhili 15516088 Jan 5 2023 admissions.csv.gz
-rwxrwxrwx 1 zhili zhili 427468 Jan 5 2023 d_hcpcs.csv.gz
-rwxrwxrwx 1 zhili zhili 859438 Jan 5 2023 d_icd_diagnoses.csv.gz
-rwxrwxrwx 1 zhili zhili 578517 Jan 5 2023 d_icd_procedures.csv.gz
-rwxrwxrwx 1 zhili zhili 12900 Jan 5 2023 d_labitems.csv.gz
-rwxrwxrwx 1 zhili zhili 25070720 Jan 5 2023 diagnoses_icd.csv.gz
-rwxrwxrwx 1 zhili zhili 7426955 Jan 5 2023 drgcodes.csv.gz
-rwxrwxrwx 1 zhili zhili 508524623 Jan 5 2023 emar.csv.gz
-rwxrwxrwx 1 zhili zhili 471096030 Jan 5 2023 emar_detail.csv.gz
-rwxrwxrwx 1 zhili zhili 1767138 Jan 5 2023 hcpcsevents.csv.gz
-rwxrwxrwx 1 zhili zhili 1939088924 Jan 5 2023 labevents.csv.gz
-rwxrwxrwx 1 zhili zhili 96698496 Jan 5 2023 microbiologyevents.csv.gz
-rwxrwxrwx 1 zhili zhili 36124944 Jan 5 2023 omr.csv.gz
-rwxrwxrwx 1 zhili zhili 2312631 Jan 5 2023 patients.csv.gz
-rwxrwxrwx 1 zhili zhili 398753125 Jan 5 2023 pharmacy.csv.gz
-rwxrwxrwx 1 zhili zhili 498505135 Jan 5 2023 poe.csv.gz
-rwxrwxrwx 1 zhili zhili 25477219 Jan 5 2023 poe_detail.csv.gz
-rwxrwxrwx 1 zhili zhili 458817415 Jan 5 2023 prescriptions.csv.gz
-rwxrwxrwx 1 zhili zhili 6027067 Jan 5 2023 procedures_icd.csv.gz
-rwxrwxrwx 1 zhili zhili 122507 Jan 5 2023 provider.csv.gz
-rwxrwxrwx 1 zhili zhili 6781247 Jan 5 2023 services.csv.gz
-rwxrwxrwx 1 zhili zhili 36158338 Jan 5 2023 transfers.csv.gz
ls -l ~/mimic/icu/total 3077984
-rwxrwxrwx 1 zhili zhili 35893 Jan 5 2023 caregiver.csv.gz
-rwxrwxrwx 1 zhili zhili 2467761053 Jan 5 2023 chartevents.csv.gz
-rwxrwxrwx 1 zhili zhili 57476 Jan 5 2023 d_items.csv.gz
-rwxrwxrwx 1 zhili zhili 45721062 Jan 5 2023 datetimeevents.csv.gz
-rwxrwxrwx 1 zhili zhili 2614571 Jan 5 2023 icustays.csv.gz
-rwxrwxrwx 1 zhili zhili 251962313 Jan 5 2023 ingredientevents.csv.gz
-rwxrwxrwx 1 zhili zhili 324218488 Jan 5 2023 inputevents.csv.gz
-rwxrwxrwx 1 zhili zhili 38747895 Jan 5 2023 outputevents.csv.gz
-rwxrwxrwx 1 zhili zhili 20717852 Jan 5 2023 procedureevents.csv.gz
Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)
Q1.1 Speed, memory, and data types
There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.
Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)
Answer:
freadis the fastest which used 0.747s in total,read_csvis the second with 0.846s of total time elapsed, andread.csvis the slowest which consumed 2.598s in total.The default parsed data types are different. Both
freadandread_csvtreated time related variables asdate-timeabbreviated as<dttm>in the output, whileread.csvtreated them ascharacterabbreviated as<chr>in the output. On the other hand,read.csvandfreadtreated the numeric variables asinteger, whileread_csvtreated them asdouble`.freadtook up the least memory of about 50MB, followed byread_csvwith around 55MB of memory, andread.csvtook up the most memory which is about 158.7MB.
system.time({
admissions1 <- read.csv("~/mimic/hosp/admissions.csv.gz")
})
system.time({
admissions2 <- read_csv("~/mimic/hosp/admissions.csv.gz")
})
system.time({
admissions3 <- fread("~/mimic/hosp/admissions.csv.gz")
})
as_tibble(admissions1)
as_tibble(admissions2)
as_tibble(admissions3)
pryr::object_size(admissions1)
pryr::object_size(admissions2)
pryr::object_size(admissions3)Q1.2 User-supplied data types
Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)
Answer: The run time decreased from 0.846s to 0.706s, and the memory usage shrank from 55MB to 38.06MB.
system.time({
admissions_user <- read_csv("~/mimic/hosp/admissions.csv.gz",
col_types = cols(
subject_id = col_integer(),
hadm_id = col_integer(),
admission_type = col_factor(),
admission_location = col_factor(),
discharge_location = col_factor(),
insurance = col_factor(),
language = col_factor(),
marital_status = col_factor(),
race = col_factor(),
hospital_expire_flag = col_logical()
)
)
}) user system elapsed
0.619 0.190 0.858
pryr::object_size(admissions_user)38.06 MB
Q2. Ingest big data files
Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.
ls -hl ~/mimic/hosp/labevents.csv.gzDisplay the first 10 lines of this file.
zcat < ~/mimic/hosp/labevents.csv.gz | head -10Q2.1 Ingest labevents.csv.gz by read_csv
Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.
Answer: The read_csv function took more than 5 minutes on my computer. There was not an error pane while the code just kept running and no result was generated. I assigned 24GB out of 32GB of RAM to WSL2 (16GB by default), but still the memory usage increased to 100% when I tried to ingest labevents.csv.gz using read_csv. As Dr.Zhou mentioned in class, this might be a result of the read_csv function trying to read the entire file into memory at once, which is not feasible for a file of this size.
labevents <- read_csv("~/mimic/hosp/labevents.csv.gz")Q2.2 Ingest selected columns of labevents.csv.gz by read_csv
Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)
Answer: The data was successfully read into R session. A total time of 3 minutes was elapsed. This might be due to the data that read_csv was trying to read in was subsetted such that it was small enough to fit into the RAM.
labevents_sub <- read_csv("~/mimic/hosp/labevents.csv.gz",
col_select = c("subject_id", "itemid",
"charttime", "valuenum"))Q2.3 Ingest subset of labevents.csv.gz
Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.
In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)
Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?
Answer: It took approximately 110 seconds to ingest labevents_filtered.csv.gz using read_csv. The new file labevents_filtered.csv.gz contains 24,855,909 lines.
zcat < ~/mimic/hosp/labevents.csv.gz | awk -F ',' '($5 == 50912 ||\
$5 == 50971 || $5 == 50983 || $5 == 50902 || $5 == 50882 || $5 == 51221 ||\
$5 == 51301 || $5 == 50931) {print $2","$5","$7","$10}' |\
gzip > labevents_filtered.csv.gzzcat < labevents_filtered.csv.gz | head -10
zcat < labevents_filtered.csv.gz | wc -l10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,50931,2180-03-23 11:51:00,95
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
10000032,51301,2180-05-06 22:25:00,5
24855909
Q2.4 Ingest labevents.csv by Apache Arrow
Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.
Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.
Answer:
The ingest+select+filter process took approximately 0.02 seconds, and the conversion from an
arrow datasetto atibbletook approximately 17 seconds. There are 24,855,909 rows in the result tibble, the first 10 rows of which are displayed below.The
Apache Arrowis an in-memory data format that can work with a variety of programming languages to transport data between systems at a very low cost. It can avoid unnecessary data serialization and deserialization processes that cost a lot of time and resources by organizing data in a columnar format in memory.P.S. Screenshots included are used to show results from which I got the numbers, since rendered file might appear slightly different from the first time I ran the code.
gzip -dk ~/mimic/hosp/labevents.csv.gzsystem.time({
labevents_arrow <- arrow::open_csv_dataset("labevents.csv") %>%
select(subject_id, itemid, charttime, valuenum) %>%
filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
}) user system elapsed
0.049 0.010 0.057
system.time({
labevents_tbbl <- labevents_arrow %>% collect()
}) user system elapsed
19.915 6.768 17.585
dim(labevents_tbbl)[1] 24855909 4
head(labevents_tbbl, 10)# A tibble: 10 × 4
subject_id itemid charttime valuenum
<int> <int> <dttm> <dbl>
1 10000032 50882 2180-03-23 04:51:00 27
2 10000032 50902 2180-03-23 04:51:00 101
3 10000032 50912 2180-03-23 04:51:00 0.4
4 10000032 50971 2180-03-23 04:51:00 3.7
5 10000032 50983 2180-03-23 04:51:00 136
6 10000032 50931 2180-03-23 04:51:00 95
7 10000032 51221 2180-03-23 04:51:00 45.4
8 10000032 51301 2180-03-23 04:51:00 3
9 10000032 51221 2180-05-06 15:25:00 42.6
10 10000032 51301 2180-05-06 15:25:00 5
Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter
Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.
Answer:
The Parquet file is about 1.9GB in size. The ingest+select+filter process of the Parquet file took approximately 0.15 seconds, and the conversion from an
arrow datasetto atibbletook approximately 2.7 seconds. There are 24,855,909 rows in the result tibble, the first 10 rows of which are displayed below.The Parquet format is a columnar storage format that is optimized for reading and writing large data sets. It is designed to be efficient for both storage and processing, and it is especially good for complex data types and nested data structures. Files stored in the Parquet format are highly compressed and can be read and written in parallel, which makes them ideal for big data applications.
P.S. Screenshots included are used to show results from which I got the numbers, since rendered file might appear slightly different from the first time I ran the code.
arrow::open_csv_dataset("labevents.csv") %>%
arrow::write_dataset(format = "parquet", path = "./labevents.parquet")system.time({
labevents_pqt <- arrow::open_dataset("labevents.parquet") %>%
select(subject_id, itemid, charttime, valuenum) %>%
filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
}) user system elapsed
0.156 0.042 0.200
system.time({
labevents_pqt_tbbl <- labevents_pqt %>% collect()
}) user system elapsed
9.439 4.724 2.648
dim(labevents_pqt_tbbl)[1] 24855909 4
head(labevents_pqt_tbbl, 10)# A tibble: 10 × 4
subject_id itemid charttime valuenum
<int> <int> <dttm> <dbl>
1 10000032 50882 2180-03-23 04:51:00 27
2 10000032 50902 2180-03-23 04:51:00 101
3 10000032 50912 2180-03-23 04:51:00 0.4
4 10000032 50971 2180-03-23 04:51:00 3.7
5 10000032 50983 2180-03-23 04:51:00 136
6 10000032 50931 2180-03-23 04:51:00 95
7 10000032 51221 2180-03-23 04:51:00 45.4
8 10000032 51301 2180-03-23 04:51:00 3
9 10000032 51221 2180-05-06 15:25:00 42.6
10 10000032 51301 2180-05-06 15:25:00 5
Q2.6 DuckDB
Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.
Answer:
The ingest+convert+select+filter process took approximately 0.17 seconds. There are 24,855,909 rows in the result tibble, the first 10 rows of which are displayed below.
DuckDB is an in-memory database management system designed primarily for analytical query workloads. It is embeddable, meaning it can be easily integrated into other applications like R and Python. DuckDB processes data in memory and stores data in a columnar format, which makes it very efficient for analytical queries.
P.S. Screenshots included are used to show results from which I got the numbers, since rendered file might appear slightly different from the first time I ran the code.
library(duckdb)
system.time({
labevents_pqt_duckdb <- arrow::open_dataset("labevents.parquet") %>%
arrow::to_duckdb(table = "labevents_pqt_duckdb") %>%
select(subject_id, itemid, charttime, valuenum) %>%
filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
}) user system elapsed
0.242 0.110 0.319
labevents_pqt_duckdb <- labevents_pqt_duckdb %>%
arrange(subject_id, charttime) %>%
as_tibble()
dim(labevents_pqt_duckdb)[1] 24855909 4
head(labevents_pqt_duckdb, 10)# A tibble: 10 × 4
subject_id itemid charttime valuenum
<dbl> <dbl> <dttm> <dbl>
1 10000032 50882 2180-03-23 11:51:00 27
2 10000032 50902 2180-03-23 11:51:00 101
3 10000032 50912 2180-03-23 11:51:00 0.4
4 10000032 50971 2180-03-23 11:51:00 3.7
5 10000032 50983 2180-03-23 11:51:00 136
6 10000032 50931 2180-03-23 11:51:00 95
7 10000032 51221 2180-03-23 11:51:00 45.4
8 10000032 51301 2180-03-23 11:51:00 3
9 10000032 51221 2180-05-06 22:25:00 42.6
10 10000032 51301 2180-05-06 22:25:00 5
Q3. Ingest and filter chartevents.csv.gz
chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
zcat < ~/mimic/icu/chartevents.csv.gz | head -10subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0
d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.
zcat < ~/mimic/icu/d_items.csv.gz | head -10itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,
In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.
Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.
Answer: There are 22,502,319 rows in the filtered tibble. The first 10 rows of the result tibble are displayed below.
gzip -dk ~/mimic/icu/chartevents.csv.gzchartevents_fltr <- arrow::open_csv_dataset("chartevents.csv") %>%
filter(itemid %in% c(220045, 220181, 220179, 223761,
220210)) %>%
collect()
dim(chartevents_fltr)[1] 22502319 11
head(chartevents_fltr, 10)# A tibble: 10 × 11
subject_id hadm_id stay_id caregiver_id charttime
<int> <int> <int> <int> <dttm>
1 10000032 29079034 39553978 47007 2180-07-23 14:01:00
2 10000032 29079034 39553978 47007 2180-07-23 14:01:00
3 10000032 29079034 39553978 47007 2180-07-23 15:00:00
4 10000032 29079034 39553978 47007 2180-07-23 15:00:00
5 10000032 29079034 39553978 47007 2180-07-23 15:00:00
6 10000032 29079034 39553978 47007 2180-07-23 15:00:00
7 10000032 29079034 39553978 66056 2180-07-23 12:00:00
8 10000032 29079034 39553978 66056 2180-07-23 12:00:00
9 10000032 29079034 39553978 66056 2180-07-23 12:00:00
10 10000032 29079034 39553978 66056 2180-07-23 12:00:00
# ℹ 6 more variables: storetime <dttm>, itemid <int>, value <chr>,
# valuenum <dbl>, valueuom <chr>, warning <int>